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Abstract Scoring systems are linear classification models that only require users to add, subtract and 
multiply a few small numbers in order to make a prediction. These models are in widespread use by the 
medical community, but are difficult to learn from data because they need to be accurate and sparse, 
have coprime integer coefficients, and satisfy multiple operational constraints. We present a new method 
for creating data-driven scoring systems called a Supersparse Linear Integer Model (SLIM). SLIM scoring 
systems are built by solving an integer program that directly encodes measures of accuracy (the 0-1 loss) 
and sparsity (the £o-seminorm) while restricting coefficients to coprime integers. SLIM can seamlessly 
incorporate a wide range of operational constraints related to accuracy and sparsity, and can produce 
highly tailored models without parameter tuning. We provide bounds on the testing and training accuracy 
of SLIM scoring systems, and present a new data reduction technique that can improve scalability by 
eliminating a portion of the training data beforehand. Our paper includes results from a collaboration with 
the Massachusetts General Hospital Sleep Laboratory, where SLIM was used to create a highly tailored 
scoring system for sleep apnea screening. 

1 Introduction 

Scoring systems are linear classification models that only require users to add, subtract and multiply a few 
small numbers in order to make a prediction. These models are used to assess the risk of numerous serious 
medical conditions since they allow physicians to make quick predictions, without extensive training, and 
without the use of a computer (see e.g., Knaus et ah, 1991, Bone et ah, 1992, Moreno et ah, 2005). Many 
medical scoring systems that are currently in use were hand-crafted by physicians, whereby a panel of 
experts simply agreed on a model (see e.g., the CHADS 2 score of Gage et ah, 2001). Some medical scoring 
systems are data-driven in the sense that they were created by rounding logistic regression coefficients (see 
e.g., the SAPS II score of Le Gall et ah, 1993). Despite the widespread use of medical scoring systems in 
high-stakes applications, there has been little to no work that has focused on a direct method to learn these 
models from data. 

Scoring systems are difficult to create using traditional machine learning methods because they need 
to be accurate, sparse, and use small coprime integer coefficients. This task is exceptionally challenging in 
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medical applications because models also need to satisfy explicit constraints on operational quantities such as 
the false positive rate or the number of features before they can be deployed. The sum of these requirements 
represent serious challenges for machine learning. Current methods for sparse linear classification such as 
the Lasso (Tibshirani, 1996) and Elastic Net (Zou and Hastie, 2005) control the accuracy and sparsity of 
models via convex surrogate functions to speed up computation, and require rounding to yield models with 
coprime integer coefficients. Approximations such as convex surrogate loss functions, ^i-regularization, and 
rounding not only degrade predictive performance but make it difficult to address operational constraints 
imposed by physicians. To train a model that satisfies a hard constraint on the false positive rate, for 
instance, we must compute its value explicitly, which is impossible when we control accuracy by means of 
a surrogate loss function. In practice, traditional methods are therefore only able to address operational 
constraints through a parameter tuning process that involves high-dimensional grid search. As we show, 
this approach often fails to produce a model that satisfies operational constraints, let alone a model that is 
optimized for predictive accuracy. 

In this paper, we present a new method to create data-driven scoring systems called a Supersparse Linear 
Integer Model (SLIM). SLIM is a integer programming problem that optimizes direct measures of accuracy 
(the 0-1 loss) and sparsity (the ^o-seminorm) while restricting coefficients to a small set of coprime integers. 
In comparison to current methods for sparse linear classification, SLIM can produce scoring systems that 
are fully optimized for accuracy and sparsity, and that satisfy a wide range of complicated operational 
constraints without any parameter tuning. 

The main contributions of our paper are as follows. 

• We present a principled machine learning approach to learn scoring systems from data. This approach 
can produce tailored scoring systems that satisfy multiple operational constraints without any parameter 
tuning. Further, it has a unique advantage for imbalanced classification problems, where constraints on 
class-based accuracy can be explicitly enforced. 

• We derive new bounds on the accuracy of discrete linear classification models. In particular, we present 
discretization bounds that guarantee that we will not lose training accuracy when the size of the coefficient 
set is sufficiently large. In addition, we present generalization bounds that relate the size of the coefficient 
set to a uniform guarantee on testing accuracy. 

• We develop a novel data reduction technique that can improve the scalability of supervised classification 
algorithms by removing a portion of the training data beforehand. Further, we show how data reduction 
can be applied directly to SLIM. 

• We present results from a collaboration with the Massachusetts General Hospital (MGH) Sleep Labora¬ 
tory where SLIM was used to create a highly tailored scoring system for sleep apnea screening. Screening 
for sleep apnea is important: the condition is difficult to diagnose, has significant costs, and affects over 
12 million people in the United States alone (Kapur, 2010). 

• We provide a detailed experimental comparison between SLIM and eight popular classification methods 
on publicly available datasets. Our results suggest that SLIM can produce scoring systems that are 
accurate and sparse in a matter of minutes. 

• We provide software to create SLIM scoring systems using MATLAB and the CPLEX API. 

The remainder of our paper is structured as follows. In the rest of Section 1, we discuss related work. In 
Section 2, we introduce SLIM and discuss its special properties. In Section 2.2, we explain how SLIM can 
easily enforce operational constraints that are important for medical scoring systems to be used in practice. 
In Section 3, we present theoretical bounds on the accuracy of SLIM scoring systems and other discrete 
linear classification models. In Section 4, we present a data reduction technique to decrease the computation 
associated with SLIM and other supervised classification methods. In Section 5, we discuss a collaboration 
with the MGH Sleep Laboratory where we used SLIM to create a highly tailored scoring system for sleep 
apnea screening. In Section 6, we present experimental results to show that SLIM can create high-quality 
scoring systems in minutes. In Section 7, we present specialized extensions of SLIM. 
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1.1 Related Work 

In what follows, we briefly discuss related work in medical scoring systems and linear classification. 

Medical Scoring Systems 

Medical scoring systems are sparse linear models with small coprime coefficients. Some popular examples 
include: SAPS I, II and III (Le Gall et ah, 1993, Moreno et ah, 2005) and APACHE I, II and III to assess 
ICU mortality risk (Knaus et ah, 1981, 1985, 1991); CHADS 2 to assess the risk of stroke in patients with 
atrial hbrillation (Gage et ah, 2001); and TIMI, to assess the risk of death and ischemic events (Antman 
et ah, 2000). Most of the scoring systems that are in widespread use today were built without optimizing 
for predictive accuracy. In some cases, physicians built scoring systems by combining existing methods and 
heuristics. The SAPS II score, for instance, was built by rounding logistic regression coefficients as Le Gall 
et al. (1993) write, “the general rule was to multiply the jd for each range by 10 and round off to the nearest 
integer.” This approach is at odds with the fact that rounding is known to produce suboptimal solutions in 
the field of integer programming. In other cases, scoring systems were hand-crafted by a panel of physicians, 
and not learned from data at all. This was the case for CHADS 2 as explained by Gage et al. (2001): “We 
calculated CHADS 2 , by adding 1 point each for each of the following - recent CHF, hypertension, age 75 years 
or older, and DM - and 2 points for a history of stroke or TIA.” Methods that can learn tailored predictive 
models from data, such as SLIM, should eliminate the need for physicians to build scoring systems by hand. 

To date, SLIM has already been used to create medical scoring systems for the purposes of diagnosing 
cognitive impairments using features derived from a clock-drawing test (see Souillard-Mandar et ah, 2015), 
and for screening sleep apnea from electronic health records (see Ustun et ah, 2015). 

Sparse Linear Classification Models 

In comparison to SLIM, current methods for sparse linear classihcation are designed to ht models with real 
coefficients, and need to be paired with a rounding procedure to create the same kinds of scoring systems 
used by physicians. In practice, rounding the coefficients of a linear model may significantly alter its accuracy 
and sparsity, and may result in a scoring system that violates operational constraints on these quantities. 
Current methods are also ill-suited to create scoring systems because they control accuracy and sparsity by 
means of convex surrogate functions to preserve scalability (see e.g., Tibshirani (1996), Efron et al. (2004)). 
As we show in Sections 5 and 6, surrogate functions result in a poor trade-off between accuracy and sparsity. 
Convex surrogate loss functions, for instance, produce models that are not robust to outliers (Nguyen and 
Sanner, 2013). Similarly, £i-regularization is only guaranteed to recover the correct sparse solution (i.e., 
the one that minimizes the fo-norm) under restrictive conditions that are rarely satisfied in practice (Zhao 
and Yu, 2007). In fact, £i-regularization may recover a solution with significantly less predictive accuracy 
relative to the correct sparse solution (see Lin et al. (2008) for a discussion). 

SLIM is also related to a recent body of work on methods for discrete linear classification. Specifically, 
Chevaleyre et al. (2013) consider training linear classifiers with binary coefficients by rounding the coeffi¬ 
cients of linear classffiers. In addition, Carrizosa et al. (2016) consider training linear classihers with small 
integer coefficients using a MIP formulation. SLIM can reproduce both of these models. The converse, how¬ 
ever, is not true because the methods of Chevaleyre et al. (2013) and Carrizosa et al. (2016): (i) optimize 
the hinge loss as opposed to the 0-1 loss; and (ii) do not include a mechanism to control sparsity. These 
differences may result in better scalability compared to SLIM. However, they also prevent these methods 
to create scoring systems that are sparse, that satisfy operational constraints on accuracy and/or sparsity, 
and that can be trained without parameter tuning. In addition to these differences, we note that the dis¬ 
cretization bounds and generalization bounds in Section 3 are a novel contribution to this body of work 
and applicable to all linear models with discrete coefficients. 
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2 Methodology 


We start with a dataset of N i.i.d. training examples = {(a:*, t/i)}fLi where Xi G X C denotes a 

vector of features [1, Xi^i,..., and Vi £ y = {—1,1} denotes a class label. We consider linear models 

of the form y = sign(A^a:), where A = [Aq, Ai, ..., Ap]^ represents a vector of coefficients and Aq represents 
an intercept term. We learn the coefficients by solving an optimization problem of the form: 


nhn Loss (A; Djv) + C'• ^(-^) 

s.t. A € £. 


( 1 ) 


Here: the loss function Loss (A;iDjv) ^ x {X xy)^ —>• R penalizes misclassifications; the coefficient penalty 

<?(A) : > R induces soft qualities that are desirable but may be sacrificed for greater accuracy; the 

coefficient set C encodes hard qualities must be satisfied; and the trade-off parameter C controls the balance 
between accuracy and soft qualities. We assume: (i) the coefficient set contains the null vector, 0 £ C\ (ii) 
the penalty is additively separable, ^(A) = intercept is never penalized, ^o(Ao) = 0. 

A Supersparse Linear Integer Model (SLIM) is a special case of the optimization in (1): 

N 

<oj +C'o||A||o + e|lA|l^ 

i=l ^ ' 

s.t. X £ £. 


SLIM directly optimizes accuracy and sparsity by minimizing the 0-1 loss ^ ^ oj and 

fo-norm ||A||q := ^ respectively. The constraints usually restrict coefficients to a finite set 

of discrete values such as £ = {—10,..., 10}^“*"^, and may include additional operational constraints such 
as ||A||g < 10. SLIM includes a tiny ^i-penalty e||A||j^ in the objective for the sole purpose of restricting 
coefficients to coprime values.^ To be clear, the £i-penalty parameter e is always set to a value that is 
small enough to avoid ti-regularization (that is, e is small enough to guarantee that SLIM never sacrifices 
accuracy or sparsity to attain a smaller ti-penalty). 

SLIM is designed to produce scoring systems that attain a pareto-optimal trade-off between accuracy 
and sparsity: when we minimize 0-1 loss and the £o-penalty, we only sacrifice classification accuracy to 
attain higher sparsity, and vice versa. Minimizing the 0-1 loss produces scoring systems that are completely 
robust to outliers and attain the best learning-theoretic guarantee on predictive accuracy (see e.g. Brooks, 
2011, Nguyen and Sanner, 2013). Similarly, controlling for sparsity via to-regularization prevents the ad¬ 
ditional loss in accuracy due to £i-regularization (see Lin et al., 2008, for a discussion). In addition to 
these performance benefits, minimizing an approximation-free object function over a finite set of discrete 
coefficients means that the free parameters in SLIM’s object have special properties. 

Remark 1 li e < £ is a finite subset of then the optimization of (2) will produce a 

scoring system with^oprime coefficients without affecting accuracy or^parsity: 

argmin < oj -h Co ||A||o -|- e ||A||;^ C argmin Xi < oj -h Co ||A||o 

i=l i=l 

N 

and gcd({A*}jLo) = 1 for all A* G argmin < oj -|- Co |lA||g -|- e ||A|1^ . 

i=l 


^ To illustrate the use of the £i-penalty, consider a classifier such as y = sign {xi + X2). If the objective in ( 2 ) only mini¬ 
mized the 0-1 loss and an ^o-penalty, then y = sign (2x1 + 2x2) would have the same objective value as y = sign (xi + X2) 
because it makes the same predictions and has the same number of non-zero coefficients. Since coefficients are restricted to 
a finite discrete set, we add a tiny £i-penalty in the objective of ( 2 ) so that SLIM chooses the classifier with the smallest 
(i.e. coprime) coefficients, y = sign (xi -|- X2). 
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Remark 2 The trade-off parameter Co represents the maximum accuracy that SLIM will sacrifice to remove 
a feature from the optimal scoring system. 

Remark 5 If Cn < -rhs and e < n. n then the optimization of (2) will produce a 

ny maxxecll A||j maXAec||A||j V \ 

scoring system with coefficients A € £ with the highest possible training accuracy: 

1 TV ^ N 

argmin < oj -|- Co ||A||q + e |jA||;^ C argmin < oj . 

i=i i=i 

Remark A If Co > 1 — ^ and e < n then the optimization of (2) will produce a 

^ TV maxAec||A||j maxxec || A|| ^ v v / v 

scoring system with coefficients A € £ with the highest possible sparsity: 

1 ^ 

argmin ^ 1 < 0 M-Co ||A||„ -|- e |jA||^ C argmin Co ||A|j„ . 

Ae£ ^ ^ L J xdC 

Note that these properties are only possible using the formulation in (2). In particular, Remarks 2-4 require 
that we control accuracy using the O-I loss and control sparsity using an £o-penalty, and Remark 1 requires 
that we restrict coefficients to a hnite discrete set. 


2.1 SLIM IP Formulation 


We train SLIM scoring system^using the following IP formulation: 
min ^ tpi + 

i=\ j=i 

p 


Mitpi 

> 

7-E 

yiXjXij 

i = 

= 1,. 

..,iV 

0-1 loss 



j=o 







= 

CoOij + 

ePj 

j = 

= 1,- 

..,P 

int. penalty 


< 

Xj < Aj 

<^3 

j = 

= 1,- 

..,P 

io-norm 

-P 3 

< 

Xj < Pj 


/ = 

= 1,. 

..,P 

-norm 

y 

£ 



j = 

= 0,. 

..,P 

coefficient set 

A 

£ 

{0,1} 


i = 

= 1,. 

..,N 

loss variables 


£ 

M+ 


j = 

= 1,. 

..,P 

penalty variables 

aj 

£ 

{0.1} 


3 = 

= 1,- 

..,P 

Iq variables 

Pj 

£ 

R+ 


j = 

= 1,- 

..,P 

variables 


(3a) 

{3b) 

(3c) 

(3d) 


Here, the constraints in (3a) set the loss variables ‘4ii = 1 < oJ to 1 if a linear classifier with 

coefficients A misclassifies example i. This is a Big-M constraint for the 0-1 loss that depends on scalar 
parameters 7 and M, (see e.g., Rubin, 2009). The value of represents the maximum score when example 
i is misclassihed, and can be set as Mi = maxxg £{7 — Xi) which is easy to compute since £ is hnite. 
The value of 7 represents the “margin” and should be set as a lower bound on yi\^Xi. When the features 
are binary, 7 can be set to any value between 0 and 1. In other cases, the lower bound is difficult to calculate 
exactly so we set 7 = 0.1, which makes an implicit assumption on the values of the features. The constraints 
in (3b) set the total penalty for each coefficient to <Pj = Cocxj + ej3j, where aj := 1 [Aj 0] is dehned by 
Big-M constraints in (3c), and Pj := |A_;| is dehned by the constraints in (3d). We denote the largest absolute 
value of each coefficient as Aj := max;^ j&Cj ly-i- 

Restricting coefficients to a hnite set results in signihcant practical benehts for the SLIM IP formulation, 
especially in comparison to other IP formulations that minimize the 0-1-loss and/or penalize the ^o-norm. 
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Many IP formulations compute the 0-1 loss and £o-norm by means of Big-M constraints that use require 
users to specify Big-M constants (see e.g., Wolsey, 1998). Restricting the coefficients to a hnite set allows 
us to bound Big-M constants in the SLIM IP formulation. Specifically, the Big-M constant for computing 
the 0-1 loss in constraints (3a) is bounded as Mi < maxxg £(7 — yiX^Xi) and the Big-M constant used to 
compute the ^o-norm in constraints (3c) is bounded as Aj < (compare with Brooks (2011), 

Guan et al. (2009) where the same parameters have to be approximated by a “sufficiently large” constants). 
Bounding these constants lead to a tighter LP relaxation, which narrows the integrality gap, and improves 
the ability of commercial IP solvers to quickly obtain a proof of optimality. 


2.2 Operational Constraints 

SLIM provides users with an unprecedented amount of flexibility over their models by allowing them 
to directly encode a wide range of operational constraints into its IP formulation. In what follows, we 
provide a few examples to illustrate this process. We note that these techniques are possible because: (i) 
the variables used to encode the 0-1 loss and ^o-penalty in the SLIM IP formulation can also encode 
operational constraints related to accuracy and sparsity; (i) the free parameters in the SLIM objective can 
be set without tuning (see Remarks 2-4). 


Loss Constraints for Imbalanced Data 

The majority of classification problems in the medical domain are imbalanced. Handling imbalanced data is 
incredibly difficult for most classification methods since maximizing classification accuracy often produces 
a trivial model (i.e., if the probability of heart attack is 1 %, a model that never predicts a heart attack is 
still 99% accurate). SLIM has a unique advantage on such problems as it not only avoid producing a trivial 
model, but can produce a model at any user-specihed point on the ROC curve without parameter tuning. 
That is, when physicians specify hard constraints on sensitivity (or specihcity), we can encode these as loss 
constraints into the IP formulation, and solve a single IP to obtain the least specihc (or most sensitive) 
model. To train the most sensitive scoring system with a maximum error of 7 G [0,1] on negatively-labeled 
examples we solve an IP with the form: 


min 

A 

N ^ 

^yiX’^Xi < 0 

] + ^ ^ < oj -h Go A p -h e A ^ 



zGX+ 




S.t. 


y^X^Xi > oj 

< 7 

(4) 


iGl“ 





A e r. 





This formulation optimizes a weighted 0-1 loss function where and W are user-dehned weights that 
control the accuracy on the positive examples from the set = {i ■. yi = -|-1}, and N~ negative 
examples from the set T~ = {i : yi = —1}, respectively. Assuming that + W~ = 1, we set W*" > 
so that SLIM weighs the accuracy on each positive example as much as all of the negative examples. 
In a typical setting, this would return a scoring system that classifies all positive examples correctly at 
the expense of misclassify all of the negative examples in order to classify an additional positive example 
correctly. In this case, however, the loss constraint (4) explicitly limits the error on negative examples to 7 . 
Thus, SLIM returns a scoring system that attains the highest sensitivity among models with a maximum 
error of 7 on negative examples. 
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Feature-Based Constraints for Input Variables 

SLIM provides fine-grained control over the composition of inpnt variables in a scoring system by formulating 
feature-based constraints. Specifically, we can use the indicator variables that encode the £o-norm aj := 
1 [Aj ^ 0] to formulate many logical constraint between features such as “either-or” conditions and “if-then” 
conditions (see (Wolsey, 1998) for an overview). This presents a practical alternative to create classification 
models that obey structured sparsity constraints (Jenatton et ah, 2011) or hierarchical constraints (Bien 
et ah, 2013). 

The indicator variables aj can be used to limit the number of input variables to at most 0 by adding the 
constraint, ctj < More complicated feature-based constraints include “if-then” constraints to ensure 

that a scoring system will only include hypertension and heart ^attack if it also includes stroke: a art .attack 
(^hypertension strokes Or hierarchical Constraints to ensure that an input variable in the leaves can only 
be used when all features above it in the hierarchy are also used: a;ea/ < (‘node for all nodes above the leaf. 


2.3 Feature-Based Preferences 

Physicians often have soft preferences between different input variables. SLIM allows practitioners to encode 
these preferences by specifying a distinct trade-off parameter for each coefficient Cqj. 

Explicitly, when our model should use feature j instead of feature fe, we set Co,fc = Cqj -t-d, where <5 > 0 
represents the maximum additional training accuracy that we are willing to sacrifice in order to use feature 
j instead of feature k. Thus, setting = Coy -h 0.02 would ensure that we would only be willing to use 
feature k instead of feature j if it yields an additional 2% gain in training accuracy over feature k. 

This approach can also be used to handle problems with missing data. Consider training a model where 
feature j contains M < N missing points. Instead of dropping these points, we can impute the values of 
the M missing examples, and adjust the trade-off parameter Cqj so that our model only uses feature j if 
it yields an additional gain in accuracy of more than M examples: 

Co.,=Co + ^. 

The adjustment factor is chosen so that: if M = 0 then Cqj = Co and if M = then Cqj- = 1 and the 
coefficient is dropped entirely (see Remark 4). This ensures that features with lots of imputed values are 
more heavily penalized than features with fewer imputed values. 


3 Bounds on Training and Testing Accuracy 

In this section, we present bounds on the training and testing accuracy of SLIM scoring systems. 


3.1 Discretization Bounds on Training Accuracy 

Our first result shows that we can always craft a finite discrete set of coefficients C so that the training 
accuracy of a linear classifier with discrete coefficients \ £ C (e.g. SLIM) is no worse than the training 
accuracy of a baseline linear classifier with real-valued coefficients p £ (e.g. SVM). 
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Theorem 1 (Minimum Margin Resolution Bound) Let p = [pi, ... ,pp]^ G denote the coefficients of 
a baseline linear classifier trained using data Vpf = Let Xmax = maxj |ia;j ||2 and 7min = minj 

denote the largest magnitude and minimum margin achieved by any training example, respectively. 

Consider training a linear classifier with coefficients A = [Ai,..., Ap]^ from the set L = {—A,..., A}^. If 
we choose a resolution parameter A such that: 


A > 


^maxt/R 

27min 


then there exists \ £ C such that the 0-1 loss of A is less than or equal to the 0-1 loss of p: 


(5) 


N 

1 < 

i=l 


N 

oj < 1 ^yip'^Xi < oj . 

i=l 


Proof See Appendix A. 

The proof of Theorem 1 uses a rounding procedure to choose a resolution parameter A so that the coefficient 
set C contains a classifier with discrete coefficients A that attains the same the 0-1 loss as the baseline 
classifier with real coefficients p. If the baseline classifier p is obtained by minimizing a convex surrogate 
loss, then the optimal SLIM classifier trained with the coefficient set from Theorem 1 may attain a lower 
0-1 loss than 1 \yip'^Xi < oj because SLIM directly minimizes the 0-1 loss. 

The next corollary yields additional bounds on the training accuracy by considering progressively larger 
values of the margin. These bounds can be used to relate the resolution parameter A to a worst-case 
guarantee on training accuracy. 

Corollary 1 Margin Resolution Bound) Let p = [pi,. .., pp^ € denote the coefficients of a linear 
classifier trained with data T)fq = {xi,yi)f—i. Let denote the value of the smallest margin, lL[k) denote 

the set of training examples with ' ||^|| *' < 7(fc), and X(^k) = Il^ilb denote the largest magnitude of any 

training example Xi £ Vjg for i ^ T(^k) ■ 

Consider training a linear classifier with coefficients A = [Ai,..., Ap]^ from the set C, = {—A ,..., A}^. If 
we choose a resolution parameter A such that: 


A > 


x^ky/P 


then there exists \ £ C such that the 0-1 loss of \ and the 0-1 loss of p differ by at most k — 1: 

N N 

yy 1 < oj - yy 1 ^Vip'^Xi < oj < A: - 1 . 

i=l i=l 

Proof The proof follows by applying Theorem 1 to the reduced dataset Vpj\I(^k)- 

We have now shown that good discretized solutions exist and can be constructed easily. This motivates 
that optimal discretized solutions, which by definition are better than rounded solutions, will also be good 
relative to the best non-discretized solution. 
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3.2 Generalization Bounds on Testing Accuracy 


According to the principle of structural risk minimization (Vapnik, 1998), fitting a classifier from a simpler 
class of models may lead to an improved guarantee on predictive accuracy. Consider training a classifier 
f : X ^ y with data Djv = where Xi G X C and Vi £ y = {—1,1}. In what follows, we provide 

uniform generalization guarantees on the predictive accuracy of all functions, f £ T. These guarantees bound 
the true risk [f{x) y] by the empirical risk ^ ^ Vi] other 

quantities important to the learning process. 

Theorem 2 (Occam’s Razor Bound for Discrete Linear Classifiers) Let T denote the set of linear clas¬ 
sifiers with coefficients X £ C: 

= |/ : A —>• T I f{x) = sign j and A G £| . 

For every 5 > 0, with probability at least \ — 5, every classifier f £ T obeys: 

^trne^f) < ^ ^ logdTI)^ log(j J 

A proof of Theorem 2 can be found in Section 3.4 of Bousquet et al. 2004. The result that more restrictive 
hypothesis spaces can lead to better generalization provides motivation for using discrete models without 
necessarily expecting a loss in predictive accuracy. The bound indicates that we include more coefficients 
in the set C as the amount of data N increases. 

In Theorem 3, we improve the generalization bound from Theorem 2 by excluding models that are 
provably suboptimal from the hypothesis space. Here, we exploit the fact that we can bound the number 
of non-zero coefficients in a SLIM scoring system based on the value of Cq. 

Theorem 3 (Generalization of Sparse Discrete Linear Classifiers) Let T denote the set of linear classi¬ 
fiers with coefficients X from a finite set C such that: 

.F = {/ : A ^ T I f{x) = sign (a^®) } 

1 ^ 

X £ argmin < oj -h Co ||A|jo 

1=1 


For every 5 > 0, with probability at least 1 — 5, every classifier f £ T obeys: 


j^true 


if) 


< R®”P(/) -h 


logdRp.Col) -log(5) 
2N 


where 


Hp,Co = |a£c| ||A|1o< 

Proof See Appendix A. 

This theorem relates the trade-off parameter Co in the SLIM objective to the generalization of SLIM 
scoring systems. It indicates that increasing the value of the Co parameter will produce a model with better 
generalization properties. 

In Theorem 4, we produce a better generalization bound by exploiting the fact that SLIM scoring 
systems use coprime integer coefficients (see Remark 1). In particular, we express the generalization bound 
from Theorem 2 using the P-dimensional Farey points of level A (see Marklof, 2012, for a dehnition). 
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Theorem 4 (Generalization of Discrete Linear Classifiers with Coprime Coefficients) 

the set of linear classifiers with coprime integer coefficients, A, bounded by A: 

= |/ : T — >• 3^ [ f{x) = sign x'j and A € £|, 

£ = |a G I |A^| < A for j = 

Z^ = |z G Z^ I gcd(z) = l|. 

For every 5 > 0, with probability at least 1 — <5, every classifier / G P obeys: 

^true(^) < ^ ^ \og{\C lo gffl ^ 


Let T denote 


where Cp^A denotes the set of Farey points of level A: 


Cp = I ^ G [0, 1)^ : (A, q) G Z^+l and 1 < g < yl 

The proof involves a connting argument over coprime integer vectors, using the definition of Farey points 
from number theory. 

In Figure 1, we plot the relative density of coprime integer vectors bounded by A (i.e., |Cp^yi|/(2yl+ 1)^), 
and the relative improvement in the generalization bound due to the use of coprime coefficients. We see that 
the use of coprime coefficients can significantly reduce the number of classifiers based on the dimensionality 
of the data and the value of A. The corresponding improvement in the generalization bound may be 
significant when the data are high dimensional and A is small. 



Fig. 1 : Relative density of coprime integer vectors in (left), and the relative improvement in the general¬ 
ization bound due to the use of coprime coefficients for 5 = 0.01 (right). 


4 Data Reduction 

Data reduction is a technique that can decrease the computation associated with training a supervised 
classification model by discarding redundant training data. This technique can be applied to any supervised 
classification method where the training procedure is carried out by solving an optimization problem. 
However, it is best suited for methods such as SLIM, where the underlying optimization problem may be 
difficult to solve for large instances. In this section, we first describe how data reduction works in a general 
setting, and then show how it can be applied to SLIM. 











Supersparse Linear Integer Models for Optimized Medical Scoring Systems 


11 


4.1 Data Reduction for Optimization-Based Supervised Classification 

Consider training a classifier f : X ^ y hy solving a computationally challenging optimization problem, 

mm Z{f-Visf) s.t. f G T. (6) 

We refer to the optimization problem in (6) as the original problem. Here, T represents the set of feasible 
classifiers and Z : iF x {X x y)^ —>• R represents its objective function. 

Data reduction aims to decrease the computation associated with solving the original problem by re¬ 
moving redundant examples from Hjv = (he., data points that can be safely discarded without 

changing the optimal solution to (6)). The technique requires users to specify a surrogate problem that is 
considerably easier to solve. Given the initial training data Hjv = ^md the surrogate problem, 

data reduction solves A'" -|- 1 variants of the surrogate problem to identify redundant examples. These ex¬ 
amples are then removed from the initial training data to leave behind a subset of reduced training data 
T^m Q that is guaranteed to yield the same optimal classiher as Djv- Thus, the computational gain from 
data reduction comes from training a model with Dm (i.e., solving an instance of the original problem with 
N — M fewer examples). 

We provide an overview of data reduction in Algorithm 1. To explain how the algorithm works, let us 
denote the surrogate problem as: 

mm Z{f-,DN) s.t f G F. (7) 

Here Z : F x {X x y)^ —>■ M denotes the objective function of the surrogate problem, and F denotes its set 
of feasible classifiers. Data reduction can be used with any surrogate problem so long as the e-level set of 
the surrogate problem contains all optimizers to the original problem. That is, we can use any feasible set 
F and any objective function Z{.) as long as we can specify a value of e such that 

Z{f*) < Z{f*)+e V/* G J"* and/* e j-*. (8) 

Here, f* denotes an optimal classiher to the original problem from the set F* = argminZ{f), and /* 
denotes an optimal classiher to the surrogate problem from the set F* = argmin^^^ Z{f). The width of the 
the surrogate level set e is related to the amount of data that will be removed. If e is too large, the method 
will not remove very many examples and will be less helpful for reducing computation (see Figure 3). 

In the hrst stage of data reduction, we solve the surrogate problem to: (i) compute the upper bound 
on the objective value of classihers in the surrogate level set Z{f*) -|-e; and (ii) to identify a baseline label 
jji := sign for each example i = 1,..., A". In the second stage of data reduction, we solve a variant 

of the surrogate problem for each example i = I,..., N. The variant of the surrogate problem includes 
an additional constraint that forces example i to be classihed as — 

inin Z{f-Df^) s.t f G F and yif{xi) < 0 (9) 

We denote the optimal classiher to the variant as /*. If /* lies outside of the surrogate level set (i.e., 
> Z{f*) -he) then no classiher in the surrogate level set will label point i as —pi. In other words, 
all classihers in the surrogate level set must label this point as iji. Since the surrogate level set contains 
the optimal classihers to the original problem by the assumption in (8), we can therefore remove example i 
from the reduced dataset Dm because we know that an optimal classiher to the original problem will label 
this point as yi. We illustrate this situation in Figure 2. 

In Theorem 5, we prove that we obtain the same set of optimal classihers if we train a model with the 
initial data Df^ or the reduced data Dm. In Theorem 6, we provide sufficient conditions for a surrogate loss 
function to satisfy the level set condition in (8). 
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Fig. 2 : We initialize data reduction with e large enough so that Z(f*) < Z{f*) + e for all f* G and all 
f* G . Here, f* is the optimal classifer to the original problem from the set of optimal classifiers , and 
f* is the optimal classifier to the surrogate problem from the set of optimal classifiers f*. Data reduction fits 
a classifier /_* for each example in the initial training data Dn by solving a variant of the surrogate problem 
with an additional constraint that forces /_* to classify z in a different way than /*. If Z{f*^) > Z{f*) +£, then 
we know the predicted class of example i under /* and can remove it from the reduced training data T)m‘ 


Algorithm 1 Data Reduction from T>]\f to T>m 

Input: initial training data, T>n = 

Input: surrogate problem, min Z(/;Djv) s.t. f G 
Input: width of the surrogate level set, e 

Vm < — 0 

/* — argmin^ Z{f]Vjs[) 

for i = 1,..., A do 



f*i <— argminZ(/;Div) s.t. f G T and yif{xi) < 0 
if Z(/*;Div) < +£ then 

Vm < — Dm U {xi,yi) 

end if 
end for 

Output: Vm, reduced training data 


Theorem 5 (Equivalence of the Reduced Data) Consider an optimization problem to train a classifier f G 
T with data Djv? 

min Z{f-,T>N) s.t f G T, 

as well as a surrogate optimization problem to train a classifier f G T with data Vpf, 

nnn Z(/;Djv) s.i. f G T. 

Let f* G T* := argmin^^gyr Z(/; Djv) o/nd f G J-* := argmin^g^ Z(/; Djy)- Z/tue choose a value of e so that 

Z{f-,VN) < Z{f*-VN) + e V/* G and f* G T*, (10) 

then Algorithm 1 will output a reduced dataset T>m LIn such that 

argminZ{/;©7v) = argmin Z{/; I?m)- 


( 11 ) 
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Proof See Appendix A. 

Theorem 6 (Sufficient Conditions to Satisfy the Level Set Condition) Consider an optimization prob¬ 
lem where the objective minimizes the 0-1 loss function Zqi : —>■ R, 

min Zqi (A), 

A6R-P 

as well as a surrogate optimization problem where the objective minimizes a surrogate loss function ip : R^ —^ R, 

min Z.j, (A). 
agk^ 

If the surrogate loss function ip satisfies the following properties for all A € R^, Aqj^ G argmin_)^gjjp Zqi (A), and 
A^ G argmin;^gjjP Z^ (A); 

I. Upper bound on the 0-1 loss: Zqi (A) < Z^{\) 

II. Lipschitz near Ajj^; ||A — A^|| < A Z^ (A) — Z^ (A^) < L||A — Aj|| 

III. Curvature near A^; ||A — A^|| > C\ Z^ (A) — Z^ (■^^) > C'i/> 

IV. Closeness of loss near Kc \Z^{Ki)-Zoi (A5i)l <£ 

then it will also satisfy a level-set condition required for data reduction, 

Zt^ (-^oi) < Z^ (■^^) ^ ''^■^01 A^, 

whenever e = LC\ obeys C^ > 2e. 

Proof See Appendix A. 


4.2 Off-The-Shelf Data Reduction for SLIM 

Data reduction can easily be applied to SLIM by using an off-the-shelf approach where we use the LP 
relaxation of the SLIM IP as the surrogate problem. The off-the-shelf approach may be used as a preliminary 
procedure before the training process, or as an iterative procedure that is called by the IP solver during 
the training process as feasible solutions are found. 

When we use the LP relaxation to the SLIM IP as the surrogate problem, we can determine a suitable 
width for the surrogate level set e by using a feasible solution to the SLIM IP. To see this, let us denote 
the SLIM IP as miny- Z{f) s.t. f £ T, and denote its LP relaxation as min^: Z{f) s.t. / G .T. In addition, let 
us denote the optimal solution to the SLIM IP as /* and the optimal solution to the LP relaxation as /*. 
Since IF C T, we have that Z{f*) < Z{f*). For any feasible solution to the SLIM IP / G we also have 
that Z{f*) < Z{f). Combining both inequalities, we see that, 

Z{f*) < Z{f) < Z{f). 

Thus, we can satisfy the level set condition (8) using a feasible solution to the SLIM IP / G by setting 
the width of the surrogate level set as 

e(/) := Z{f) - Z{f*). 

In Figure 3, we show much training data can be discarded using off-the-shelf data reduction when we 
train a SLIM scoring system on the bankruptcy dataset (see Table 4). Specifically, we plot the percentage 
of data removed by Algorithm 1 for values of e G [emin,£max] where Emin and £max represent the smallest 
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and largest widths of the surrogate level set that could be used in practice. In particular, Smin is computed 
using the optimal solution to the IP as: 


£min := Z{f) - Z{f*), 

and £max is computed using a feasible solution to the IP that can be guessed without any computation (i.e., 
a linear classiher with coefficients A = 0): 


emax:=^(0)-Z(/*). 

In this case, we can discard over 40% of the training data by using the trivial solution A = 0, and discard 
over 80% of the training data by using a higher quality feasible solution. 



1 2 3 4 5 


e 


Fig. 3: Proportion of training data filtered as a function of the width of the level set, e for the bankruptcy 
dataset. Here, the original problem is an instance of the SLIM IP with Co = 0.01 and C = {—10,..., lO}^”*"^. 


5 Application to Sleep Apnea Screening 

In this section, we discuss a collaboration with the MGH Sleep Laboratory where we used SLIM to create a 
scoring system for sleep apnea screening (see also Ustun et ah, 2015, for a far more detailed treatment). Our 
goal is to highlight the flexibility and performance of our approach on a real-world problem that requires a 
tailored prediction model. 


5.1 Data and Operational Constraints 

The dataset for this application contains N = 1922 records of patients and P = 33 binary features related 
to their health and sleep habits. Here, yi = -|-1 if patient i has obstructive sleep apnea (OSA) and yi = —1 
otherwise. There is signihcant class imbalance as Pr(t/i = -|-1) = 76.9%. 

To ensure that the scoring system we produced would be used and accepted by physicians, our collab¬ 
orators specified three simple operational constraints: 

1. Limited FPR: The model had to achieve the highest possible true positive rate (TPR) while maintaining 
a maximum false positive rate (FPR) of 20%. This would ensure that the model could diagnose as many 
cases of sleep apnea as possible but limit the number of faulty diagnoses. 
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2. Limited Model Size: The model had to be transparent and use at most 5 features. This would ensure that 
the model was could be explained and understood by other physicians in a short period of time. 

3. Sign Constraints: The model had to obey established relationships between well-known risk factors and 
the incidence of sleep apnea (e.g. it could not suggest that a patient with hypertension had a lower risk 
of sleep apnea since hypertension is a positive risk factor for sleep apnea). 


5.2 Training Setup and Model Selection 

We trained a SLIM scoring system with integer coefficients between —10 and 10. We addressed all three 
operational constraints without parameter tuning or model selection, as follows: 

• We added a loss constraint using the loss variables to limit the maximum FPR at 20%. We then set 

= N~/{I + N~) to guarantee that the optimization would yield a classiher with the highest possible 
TPR with a maximum FPR less than 20% (see Section 2.2). 

• We added a feature-based constraint using the loss variables to limit the maximum number of features 
to 5 (see Section 2.2). We then set Co = 0.9W~/NP so that the optimization would yield a classifier 
that did not sacrifice accuracy for sparsity (see Remark 3). 

• We added sign constraints to the coefficients to ensure that our model would not violate established 
relationships between features and the predicted outcome (i.e., we set \j > 0 if there had to be a positive 
relationship, and Xj < 0 if there had to be a negative relationship). 

With this setup, we trained 10 models with subsets of the data to assess predictive accuracy via 10-fold cross 
validation (10-CV), and 1 final model with all of data to hand over to our collaborators. We set up each 
IP using the slim_for_matlab toolbox (Ustun, 2015) and solved each IP for 1 hour, in parallel, on 12-core 
2.7GHZ machine with 48GB RAM. Thus, the training process for SLIM required 1 hour of computing time. 

As a comparison, we trained models with 8 baseline classihcation methods shown in Table 1. We dealt 
with the class imbalance by using a cost-sensitive approach, where we used a weighted loss function and 
varied its sensitivity parameter across a large range. When possible, we addressed the remaining 
operational constraints by searching over a hne grid of free parameters. Model selection was difficult for 
baseline methods because they could not accomodate operational constraints in the same way as SLIM. For 
each baseline method, we chose the best model that satished all operational constraints by: (i) dropping 
any instance of the free parameters where operational constraints were violated; (ii) choosing the instance 
that maximized the 10-CV mean test TPR. We ruled that an instance of the free parameters violated an 
operational constraint if any of the following conditions were met: (1) the 10-CV mean test FPR of the 
model produced with the instance was greater than the 10-CV mean test FPR of the SLIM model (20.9%); 
(2) the model size^ of the final model produced with the instance was greater than 5; (3) the hnal model 
produced did not obey sign constraints. This model selection procedure may have biased the results in favor 
of the baseline methods because we mixed testing and training data by looking at the final model to ensure 
that operational constraints were satished. 


5.3 Results and Observations 

In what follows, we report our observations related to operational constraints, predictive performance and 
interpretability. We show the performance of the best model we trained using each method in Table 2, and 
summarize the operational constraints they were able to satisfy in Table 3. 

^ Model size represents the number of coefficients for linear models (Lasso, Ridge, Elastic Net, SLIM, SVM Lin.), the 
number of leaves for decision tree models (C5.0T, CART), and the number of rules for rule-based models (C5.0R). For 
completeness, we set the model size for black-box models (SVM RBF) to the number of features in each dataset. 
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Method 

Controls 

^ Instances 

Settings and Free Parameters 

CART 

Max FPR 
Model Size 

39 

39 values of 1V+ G {0.025, 0.05, . . . , 0.975} 

C5.0T 

Max FPR 

39 

39 values of W’" G {0.025, 0.05, . . . , 0.975} 

C5.0R 

Max FPR 
Model Size 

39 

39 values of IV+ G {0.025, 0.05, . . . , 0.975} 

Lasso 

Max FPR 
Model Size 
Signs 

39000 

39 values of W+ G {0.025, 0.05, . . . , 0.975} 

X 1000 values of A chosen by glmnet 

Ridge 

Max FPR 

39000 

39 values of 1V+ G {0.025, 0.05, . . . , 0.975} 

Signs 

X 1000 values of A chosen by glmnet 


Max FPR 


39 values of IV+ G {0.025, 0.05, . . . , 0.975} 

Elastic Net 

Model Size 

975000 

X 1000 values of A chosen by glmnet 


Signs 


X 19 values of a G {0.05, 0.10, . . . , 0.95} 

SVM Lin. 

Max FPR 

975 

39 values of 1V+ G {0.025, 0.05, . . . , 0.975} 

X 25 values of C G {10“^, 10“^ . . . , 10^} 

SVM RBF 

Max FPR 

975 

39 values of 1V+ G {0.025, 0.05, . . . , 0.975} 

X 25 values of C G {10“^, . . . , 10^} 

SLIM 

Max FPR 
Model Size 
Signs 

1 

W+ = JV~/(1 + JV~), Co = 0.9W/AfP, 

Ao G {-100, . . . , 100}, Xj G {-10, . . . , 10} 


Table 1: Training setup for all methods. An instance is a unique combination of free parameters. Controls 
refer to operational constraints that we expect each method to handle. We include further details on methods 
and software packages in Table 5. 


On the Difficulties of Handling Operational Constraints 

Among the 9 classification methods that we used, only SLIM, Lasso and Elastic Net could produce a model 
that satisfied all of operational constraints given to us by physicians. Tree and rule-based methods such 
as CART, C5.0 Tree and C5.0 Rule were unable to produce a model with a maximum FPR of 20% (see 
Figure 4). Methods that used ^ 2 -regularization such as Ridge, SVM Lin. and SVM RBF were unable to 
produce a model with the required level of sparsity. While we did not expect all methods to satisfy all 
of the operational constraints, we included them to emphasize the following important points. Namely, 
state-of-the-art methods for applied predictive modeling do not: 

• Handle simple operational constraints that are crucial for models to be used and accepted. Implemen¬ 
tations of popular classihcation methods do not have a mechanism to adjust important model qualities. 
That is, there is no mechanism to control sparsity in C5.0T (Kuhn et al. 2012) and no mechanism to 
incorporate sign constraints in SVM (Meyer et al. 2012). Finding a method with suitable controls is 
especially difficult when a model has to satisfy multiple operational constraints. 

• Have controls that are easy-to-use and/or that work correctly. When a method has suitable controls to 
handle operational constraints, producing a model often requires a tuning process over a high-dimensional 
free parameter grid. Even after extensive tuning, however, it is possible to never find a model that satisfies 
all operational constraints (e.g. CART, C5.0R, C5.0T for the Max FPR constraint in Figure 4). 

• Allow tuning to be portable when the training set changes. Consider a standard model selection procedure 
where we choose free parameters to maximize predictive accuracy. In this case, we would train models 
on several folds for each instance of the free parameters, choose an instance of the free parameters that 
maximized our estimate of predictive accuracy among the instances that met all operational constraints, 
and then train a final model using these values of the free parameters. Unfortunately, there is no guarantee 
that the hnal model will obey all operational constraints. 














Supersparse Linear Integer Models for Optimized Medical Scoring Systems 


17 




OBJECTIVE 

CONSTRAINTS 


OTHER INFORMATION 


Method 

Constraints 

Satisfied 

Test 

TPR 

Test Final 

F^R 

Size 

Model 

Size 

Train Train 

TPR FPR 

Final 

Train 

TPR 

Final 

Train 

FPR 

SLIM 

All 

61.4% 

55.5 - 68 . 8 % 

20.9% 5 

15.0 - 30 . 4 % 

5 

5-5 

62.4% 19.7% 

61.0 - 64 . 2 % 19.3 - 20 . 0 % 

62.0% 

19.6% 

Lasso 

All 

29.3% 

8.6% 3 

3 

28.7% 7.2% 

22.1% 

3.8% 

19.2 - 60 . 0 % 

0.0 - 33 . 3 % 

3-6 

21.4 - 54 . 6 % 3.5 - 20 . 5 % 

- 

- 

Elastic Net 

All 

44.2% 

18.8% 3 

3 

45.6% 17.4% 

54.3% 

20.7% 

0.0 - 64 . 1 % 

0.0 - 37 . 0 % 

3-6 

0.0 - 66 . 5 % 0.0 - 36 . 4 % 

- 

- 

Ridge 

Max FPR 

66.0% 

20.6% 30 

30 

66.4% 18.9% 

66.0% 

18.9% 

60.5 - 68 . 5 % 

8.6 - 32 . 6 % 

30 - 30 

64.0 - 68 . 9 % 17.3 - 21 . 5 % 

- 


SVM RBF 

Max FPR 

64.3% 

59.2 - 71 . 1 % 

20.8% 33 

10.0 - 30 . 4 % 

33 

33 - 33 

67.9% 12.2% 

66.5 - 70 . 0 % 11.1 - 13 . 3 % 

67.8% 

12.4% 

SVM Lin. 

Max FPR 

62.7% 

57.9 - 69 . 0 % 

19.8% 31 

7.5 - 28 . 6 % 

31 

31 - 31 

63.7% 17.0% 

61.5 - 66 . 1 % 15.6 - 18 . 5 % 

63.1% 

17.1% 

C5.0R 

None 

84.0% 

78.9 - 87 . 7 % 

43.0% 26 

32.6 - 54 . 2 % 

23 

18 - 30 

86.1% 33.8% 

84.2 - 88 . 5 % 30.9 - 38 . 2 % 

85.5% 

32.9% 

C5.0T 

None 

81.3% 

77.4 - 84 . 8 % 

42.9% 39 

29.6 - 62 . 5 % 

42 

39 - 50 

85.3% 29.5% 

82.6 - 88 . 6 % 24.6 - 33 . 7 % 

84.5% 

28.4% 

CART 

None 

93.0% 

88.8 - 96 . 1 % 

70.4% 8 

61.1 - 83 . 3 % 

9 

4 - 12 

95.2% 66.8% 

93.1 - 97 . 2 % 55.0 - 76 . 0 % 

95.9% 

73.9% 

Table 

2: TPR, FPR and model size for all methods. We report the 

10-CV mean TPR and FPR, and the 

10-CV median for the model size. The ranges in each cell represent the 10-CV minimum and maximum. 





% of Instances that Satisfied 




Method 

Max FPR 

Max FPR Model 

Size Max 

FPR, Model Size Signs 



SLIM 

100.0% 

100.0% 


100.0% 




Lasso 

19.6% 

4.8% 


4.8% 




Elastic Net 

18.3% 

1.0% 


1.0% 




Ridge 

20.9% 

0.0% 


0.0% 




SVM Lin 

18.7% 

0.0% 


0.0% 




SVM RBF 

15.8% 

0.0% 


0.0% 




C5.0R 

0.0% 

0.0% 


0.0% 




C5.0T 

0.0% 

0.0% 


0.0% 




CART 

0.0% 

0.0% 


0.0% 




Table 3: Percentage of instances that fulfilled operational constraints. Each instance is a unique combination 
of free parameters for a given method. 


On the Sensitivity of Acceptable Models 

Among the three methods that produced acceptable models, the scoring system produced by SLIM had 
significantly higher sensitivity than the models produced by Lasso and Elastic Net - a result that we 
expected given that SLIM minimizes the 0-1 loss and an £o-penalty while Lasso and Elastic Net minimize 
convex surrogates of these quantities. This result held true even when we relaxed various operational 
constraints. In Figure 5, for instance, we plot the sensitivity and sparsity of models that satisfied the max 
FPR and sign constraints. Here, we see that Lasso and Elastic Net need at least 8 coefficients to produce 
a model with the same degree of sensitivity as SLIM. In Figure 6, we plot the TPR and FPR of models 
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Fig. 4: 10-CV mean test FPR for models trained with CART, C5.0, C5.0T across the full range of VF+. These 
methods cannot produce a model that satisfies the max FPR < 20% constraint. 



Elastic Net 

♦ Lasso 

♦ SLIM 


Fig. 5: Sensitivity and model size of Lasso and Elastic Net models that satisfy the sign and FPR constraints. 
For each method, we plot the instance that attains the highest 10-CV mean test TPR at model sizes between 0 
and 8. Lasso and Elastic Net need at least 8 coefficients to produce a model with the same sensitivity as SLIM. 


that satisfied the sign and model size constraints. As shown, SLIM scoring systems dominate Lasso and 
Elastic Net models across the entire ROC curve. These sensitivity advantages are also evident in Table 2: 
in particular, SLIM yields a model with a similar level of sensitivity and specificity as Ridge and SVM Lin. 
even as it is fitting models from a far smaller hypothesis space (i.e. linear classifiers with 5 features, sign 
constraints and integer coefficients vs. linear classifiers with real coefficients). 

On the Usability and Interpretability of Acceptable Models 

To discuss interpretability, we compare the best models that satisfied all operational constraints in Figure 
7, and present the SLIM model as a scoring system in Figure 8. 

In this case, our collaborators found that all three models were aligned with domain knowledge as they 
obeyed sign constraints and had large coefficients for well-known risk factors such as 6mi, female, age, 
snoring and/or hypertension. Unfortunately, the Lasso and Elastic Net models could not be deployed as 
screening tools due to their poor sensitivity (29.3% for Lasso and 44.2% for Elastic Net). This was not the 
case for the SLIM model, which had a much higher sensitivity (61.4%). 

Our results highlight some of the unique interpretability benefits of SLIM scoring systems - that is, their 
ability to provide “a qualitative understanding of the relationship between joint values of the input variables 
and the resulting predicted response value” (Hastie et ah, 2009). SLIM scoring systems are well-suited to 
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0% 20% 40% 60% 80% 100% 

False Positive Rate 


Fig. 6: ROC curve for SLIM, Lasso and Elastic Net instances that satisfy the sign and model size constraints. 
For each method, we plot the instance that attains the highest 10-CV mean test TPR for 10-CV mean FPR 
values of 5%, 10%,..., 95%. Note that we had to train 19 additional instances of SLIM to create this plot. 


SLIM 4 age >60 +4 hypertension + 2 bmi >30 +2 bmi >40 — 6 female — 1 

Lasso 0.13 snoring + 0.12 hypertension — 0.26 female — 0.17 

Elastic Net 0.03 snoring + 0.02 hypertension — 0.09 female — 0.02 

Fig. 7: Score functions of the most sensitive predictive models that satisfied all three operational constraints. 
The baseline models have very poor sensitivity as shown in Table 2. 


PREDICT PATIENT HAS OBSTRUCTIVE SLEEP APNEA IF SCORE > 1 


1 . 

age > 60 

4 points 


2. 

hypertension 

4 points 

+ . 

3. 

body mass index > 30 

2 points 

+ . 

4. 

body mass index > 40 

2 points 

+ . 

5. 

female 

-6 points 

+ . 


ADD POINTS FROM ROWS 1-5 

SCORE 

= . 


Fig. 8: SLIM scoring system for sleep apnea screening. This model achieves a 10-CV mean test TPR/FPR of 
61.4/20.9%, obeys all operational constraints, and was trained without parameter tuning. It also generalizes 
well due to the simplicity of the hypothesis space: here the training TPR/FPR of the final model is 62.0/19.6%. 


provide this kind of qualitative understanding due to their high level of sparsity and small integer coefficients. 
These qualities help users gauge the influence of each input variable with respect to the others, which is 
especially important because humans can only handle a few cognitive entities at once (7 ± 2 according 
to Miller 1984), and are seriously limited in estimating the association between three or more variables 
(Jennings et ah, 1982). Sparsity and small integer coefficients also allow users to make quick predictions 
without a computer or a calculator, which may help them understand how the model works by actively 
using it to classify prototypical examples. Here, this process helped our collaborators come up with the 
following simple rule-based explanation for our model predicted that a patient has OSA (i.e., when SCORE 
> 1): “ 2 / the patient is male, predict OSA if age > 60 OR hypertension OR bmi > 30; if the patient is female, 
predict OSA if bmi > 40 AND (age > 60 OR hypertension).'^ 















20 


Berk Ustun, Cynthia Rudin 


6 Numerical Experiments 

In this section, we present nnmerical experiments to compare the accuracy and sparsity of SLIM scoring 
systems to other popular classification models. Our goal is to illustrate the off-the-shelf performance of 
SLIM and show that we can train accurate scoring systems for real-sized datasets in minutes. 


6.1 Experimental Setup 

Datasets'. We ran numerical experiments on 8 datasets from the UCI Machine Learning Repository (Bache 
and Lichman, 2013) summarized in Table 4. We chose these datasets to explore the performance of each 
method as we varied the size and nature of the training data. We processed each dataset by binarizing 
all categorical features and some real-valued features. For the purposes of reproducibility, we include all 
processed datasets in Online Resource 1. 


Dataset 

Source 

N 

P 

Classification Task 

adult 

Kohavi (1996) 

32561 

36 

predict if a U.S. resident earns more than $50 000 

breastcancer 

Mangasarian et al. (1995) 

683 

9 

detect breast cancer using a biopsy 

bankruptcy 

Kim and Han (2003) 

250 

6 

predict if a firm will go bankrupt 

haberman 

Haberman (1976) 

306 

3 

predict 5-year survival after breast cancer surgery 

heart 

Detrano et al. (1989) 

303 

32 

identify patients a high risk of heart disease 

mammo 

Elter et al. (2007) 

961 

12 

detect breast cancer using a mammogram 

mushroom 

Schlimmer (1987) 

8124 

113 

predict if a mushroom is poisonous 

spambase 

Cranor and LaMacchia (1998) 

4601 

57 

predict if an e-mail is spam 


Table 4: Datasets used in the numerical experiments. 


Methods'. We summarize the training setup for each method in Table 5. We trained SLIM scoring systems 
using slim_for_matlab toolbox paired with the CPLEX 12.6.0 API and models with baseline methods using 
publicly available packages in R 3.1.1 (R Core Team, 2014). For each method, each dataset, and each unique 
combination of free parameters, we trained 10 models using subsets of the data to estimate predictive 
accuracy via 10-fold cross-validation (10-CV), and 1 final model using all of the data to assess sparsity and 
interpretability. We ran all baseline methods without time constraints over a large grid of free parameters. 
We produced an £o-regularization path for SLIM by solving 6x11 IPs for each dataset (6 values of Co x 11 
training runs per Co). We allocated at most 10 minutes to solve each IP, and solved 12 IPs in parallel on a 
12-core 2.7 GHZ machine with 48 GB RAM. Thus, it took at most 1 hour to train SLIM scoring systems 
for each dataset. Since the adult and haberman datasets were imbalanced, we trained all methods on 
these datasets with a weighted loss function where we set = N~ /N and W~ = /N. 


6.2 Results and Observations 

We summarize the results of our experiments in Table 6 and Figures 13-14. We report the sparsity of 
models using a metric that we call model size. Model size represents the number of coefficients for linear 
models (Lasso, Ridge, Elastic Net, SLIM, SVM Lin.), the number of leaves for decision tree models {G5.0T, 
GART), and the number of rules for rule-based models (G5.0R). For completeness, we set the model size 
for black-box models (SVM RBF) to the number of features in each dataset. 
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Method 

Acronym 

Software 

Settings and Free Parameters 

CART Decision Trees 

CART 

rpart (Therneau et al., 2012) 

default settings 

C5.0 Decision Trees 

C5.0T 

c50 (Kuhn et al., 2012) 

default settings 

C5.0 Rule List 

C5.0R 

c50 (Kuhn et al., 2012) 

default settings 

Log. Reg. -|- penalty 

Lasso 

glnmet (Friedman et al., 2010) 1000 values of A chosen by glmnet 

Log. Reg. -|- £2 penalty 

Ridge 

glnmet (Friedman et al., 2010) 1000 values of A chosen by glmnet 

Log. Reg. -|- £iI £2 penalty Elastic Net glnmet (Friedman et al., 2010) 

1000 values of A chosen by glmnet 

X 19 values of a G {0.05, 0.10, . . . , 0.95} 

SVM -h Linear Kernel 

SVM Lin. 

el071 (Meyer et al.. 2012) 

25 values of C G {10“^, lO”^ ’’'^, .... 10^} 

SVM + RBF Kernel 

SVM RBF 

el071 (Meyer et al.. 2012) 

25 values of C G {10“^, lO”^'’’®, .... 10^} 

SLIM Scoring Systems 

SLIM 

slim_for_matlab (Ustun, 2015) 

6 values of Co G {0.01,0.075. 0.05, 0.025, 0.001, 0.9/NP} 
with \j G {-10, . . . , 10}; Aq G {-100, . . . , 100} 


Table 5: Training setup for classification methods used for the numerical experiments. 


We show the accuracy and sparsity of all methods on all dataset in Figures 13-14. For each dataset, and 
each method, we plot a point at the 10-CV mean test error and final model size, and surround this point 
with an error bar whose height corresponds to the 10-CV standard deviation in test error. In addition, we 
include ^o-regularization paths for SLIM and Lasso on the right side of Figures 13—14 to show how the test 
error varies at different levels of sparsity for sparse linear models. 


Dataset 

Details 


Metric 

SLIM 

Lasso 

R-idge 

Elastic Net 

C5.0R 

C5.0T 

CART 

SVM Lin. 

SVM RBF 


N 

32561 test error 

17.4 ± 1,4% 

17.3 ± 0.9% 

17.6 ± 0.9% 

17.4 ± 0.9% 

26.4 ± 1.8% 26.3 ± 1.4% 75,9 ± 0.0% 

16.8 ± 0.8% 

16.3 ± 0.5% 

adult 

P 

37 

train error 

17.5 ± 1.2% 

17,2 ± 0.1% 

17.6 ± 0.1% 

17.4 ± 0.1% 

25.3 ± 0,4% 24.9 ± 0.4% 75.9 ± 0.0% 

16.7 ± 0,1% 

16.3 ± 0.1% 

Pr(y=+1) 

24% 

model size 

18 

14 

36 

17 

41 

87 

4 

36 

36 


Pr(y=-1) 76% 

model range 

7 - 26 

13 - 14 

36 - 36 

16 - 18 

38 - 46 

78 - 99 

4 - 4 

36 - 36 

36 - 36 


N 

683 

test error 

3.4 ± 2.0% 

3.4 ± 2.2% 

3.4 ± 2.0% 

3,1 ± 2.1% 

4.3 ± 3.3% 

5.3 ± 3.4% 

5.6 ± 1.9% 

3.1 ± 2.0% 

3.5 ± 2.5% 

breastcancer 

P 

10 

train error 

3.2 ± 0.2% 

2.9 ± 0.3% 

3.0 ± 0.3% 

2,8 ± 0.3% 

2.1 ± 0.3% 

1.6 ± 0.4% 

3.6 ± 0.3% 

2.7 ± 0.2% 

0.3 ± 0,1% 

Pr(y=+1) 

35% 

model size 

2 

9 

9 

9 

8 

13 

7 

9 

9 


Pr(y=-1) 65% 

model range 

2 - 2 

8 - 9 

9 - 9 

9 - 9 

6 - 9 

7 - 16 

3 - 7 

9 - 9 

9 - 9 


N 

250 

test error 

0.8 ± 1.7% 

0.0 ± 0.0% 

0.4 ± 1.3% 

0,0 ± 0.0% 

0.8 ± 1.7% 

0.8 ± 1.7% 

1.6 ± 2.8% 

0.4 ± 1.3% 

0,4 ± 1.3% 

bankruptcy 

P 

7 

train error 

0.0 ± 0.0% 

0.0 ± 0.0% 

0.4 ± 0.1% 

0,4 ± 0.7% 

0.4 ± 0.2% 

0.4 ± 0.2% 

1.6 ± 0.3% 

0.4 ± 0.1% 

0,4 ± 0.1% 

Pr(y=+1) 

57% 

model size 

3 

3 

6 

3 

4 

4 

2 

6 

6 


Pr(y=-1) 43% 

model range 

2 - 3 

3 - 3 

6 - 6 

3 - 3 

4 - 4 

4-4 

2 - 2 

6 - 6 

6 - 6 


N 

306 

test error 

29.2 ± 14.0% 42.5 ± 11.3% 36.9 ± 15,0% 40,9 ± 14.0% 42.7 ± 9,4% 42.7 ± 9,4% 43.1 ± 8.0% 45.3 ± 14.7% 47.5 ± 6.2% 

haberman 

P 

4 

train error 

29.7 ± 1.5% 40.6 ± 1,9% 41.0 ± 9,7% 45,1 ± 12.0% 40.4 ± 8.5% 40.4 ± 8,5% 34,3 ± 2.8% 

46.0 ± 3.6% 

5.4 ± 1.5% 

Pr(!/=+l) 

74% 

model size 

3 

2 

3 

1 

3 

3 

9 

3 

4 


Pr(y=-1) 

26% 

model range 

2 - 3 

2 - 2 

3 - 3 

1 - 1 

0 - 3 

1 - 3 

4 - 9 

3 - 3 

4 - 4 


N 

961 

test error 

19.5 ± 3,0% 

19,0 ± 3.1% 

19.2 ± 3.0% 

19.0 ± 3.1% 

20.5 ± 3.3% 20.3 ± 3.5% 20.7 ± 3.9% 

20.3 ± 3.0% 

19.1 ± 3.1% 

mammo 

P 

15 

train error 

18.3 ± 0,3% 

19.3 ± 0.3% 

19.2 ± 0.4% 

19.2 ± 0.3% 

19.8 ± 0,3% 19.9 ± 0,3% 20,0 ± 0.6% 

20.3 ± 0.4% 

18.2 ± 0.4% 

Pr(y=+1) 

46% 

model size 

9 

13 

14 

14 

5 

5 

5 

14 

14 


Pr(!/=-l) 

54% 

model range 

9 - 11 

12 - 13 

14 - 14 

13 - 14 

3 - 5 

4 - 6 

3 - 5 

14 - 14 

14 - 14 


N 

303 

test error 

16.5 ± 7.8% 

15,2 ± 6.3% 

14.9 ± 5.9% 

14.5 ± 5.9% 

21.2 ± 7.5% 23.2 ± 6.8% 19.8 ± 6.5% 

15.5 ± 6.5% 

15.2 ± 6.0% 

heart 

P 

33 

train error 

14.9 ± 1.1% 

14.0 ± 1.0% 

13.1 ± 0.8% 

13.2 ± 0.6% 

10.0 ± 1,8% 

8.5 ± 2.0% 

14.3 ± 0.9% 

13.6 ± 0.5% 

10.4 ± 0.8% 

Pr(y=+1) 

46% 

model size 

3 

12 

32 

24 

7 

16 

6 

31 

32 


Pr(!/=-l) 

54% 

model range 

3 - 3 

10 - 13 

30 - 32 

22 - 27 

9 - 17 

12 - 27 

6 - 8 

28 - 32 

32 - 32 


N 

8124 

test error 

0.0 ± 0.0% 

0.0 ± 0.0% 

1.7 ± 0.3% 

0,0 ± 0.0% 

0.0 ± 0.0% 

0.0 ± 0.0% 

1.2 ± 0.6% 

0.0 ± 0.0% 

0.0 ± 0.0% 

mushroom 

P 

114 

train error 

0.0 ± 0.0% 

0.0 ± 0.0% 

1.7 ± 0.0% 

0,0 ± 0.0% 

0.0 ± 0.0% 

0.0 ± 0.0% 

1.1 ± 0.3% 

0.0 ± 0.0% 

0,0 ± 0.0% 

Pr(y=+1) 

48% 

model size 

7 

21 

113 

108 

8 

9 

7 

98 

113 


Pr(!/=-l) 

52% 

model range 

7 - 7 

19 - 23 

113 - 113 

106 - 108 

8 - 8 

9 - 9 

6 - 8 

98 - 108 

113 - 113 


N 

4601 

test error 

6.3 ± 1.2% 

10.0 ± 1.7% 

26.3 ± 1.7% 

10.0 ± 1.7% 

6.6 ± 1.3% 

7.3 ± 1.0% 

11,1 ± 1.4% 

7.8 ± 1.5% 

13.7 ± 1.4% 

spambase 

P 

58 

train error 

5.7 ± 0.3% 

9.5 ± 0.3% 

26.1 ± 0.2% 

9,6 ± 0.2% 

4.2 ± 0.3% 

3.9 ± 0.3% 

9.8 ± 0.3% 

8.1 ± 0.8% 

1,3 ± 0.1% 

Pr(y=+1) 

39% 

model size 

34 

28 

57 

28 

29 

73 

7 

57 

57 


Pr{v=-1) 

61% 

model range 

28 - 40 

28 - 29 

57 - 57 

28 - 29 

23 - 31 

56 - 78 

6 - 10 

57 - 57 

57 - 57 


Table 6: Accuracy and sparsity of all methods on all datasets. Here: test error refers to the 10-CV mean test 
error ± the 10-CV standard deviation in test error; train error refers to the 10-CV mean training error ± the 
10-CV standard deviation in training error; model size refers to the final model size; and model range refers to 
the 10-CV minimum and maximum model size. The results reflect the models produced by each method when 
free parameters are chosen to minimize the 10-CV mean test error. We report the 10-CV weighted test and 
training error for adult and haberman. 
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We wish to make the following observations regarding our results: 

On the Accuracy, Sparsity and Computation 

Our results show that many methods are unable to produce models that attain the same levels of accuracy 
and sparsity as SLIM. As shown in Figures 13-14, SLIM always produces a model that is more accurate 
than Lasso at some level of sparsity, and sometimes more accurate at all levels of sparsity (e.g., spambase, 
haberman, mushroom, breastcancer) . Although optimization problems to train SLIM scoring systems were 
AfP-hard, we did not find any evidence that computational issues hurt the performance of SLIM on any of 
the datasets. We obtained accurate and sparse models for all datasets in 10 minutes using CPLEX 12.6. 
Further, the solver provided a proof of optimality (i.e., a MIPGAP of 0.0%) for all models we trained for 
mammo, mushroom, bankruptcy, breastcancer. We attribute these benefits to SLIM’s tighter MIP formulation 
(see Section 2.1). 

On the Regularization Effect of Discrete Coefficients 

We expect that methods that directly optimize accuracy and sparsity will achieve the best possible accuracy 
at every level of sparsity (i.e. the best possible trade-off between accuracy and sparsity). SLIM directly 
optimizes accuracy and sparsity. However, it may not necessarily achieve the best possible accuracy at each 
level of sparsity because it restricts coefficients to a finite discrete set C. 

By comparing SLIM to Lasso, we can identify a baseline regularization effect due to this £. set restriction. 
In particular, we know that when Lasso’s performance dominates that of SLIM, it is very arguably due to 
the use of a small set of discrete coefficients. Our results show that this tends to happen mainly at large 
model sizes (see e.g., the regularization path for breastcancer, heart, mammo). This suggests that the £ set 
restriction has a more noticeable impact on accuracy at larger model sizes. 

One interesting effect of the C set restriction is that the most accurate SLIM scoring system may not use 
all of the features in the dataset. In our experiments, we always trained SLIM with Co = 0.9/NP to obtain 
a scoring system with the highest training accuracy among linear models with coefficients in A € £ (see 
Remark 3). In the bankruptcy dataset, for example, we find that this model only uses 3 out of 6 features. 
This is due to the £ set restriction: if the £ restriction were relaxed, then the method would use all features 
to improve its training accuracy (as is the case with Ridge or SVM Lin.). 

On the Interpretability of Models 

To discuss interpretability, we focus on the mushroom dataset, which provides a nice basis for comparison 
as many methods produce a model that attains perfect predictive accuracy. In Figures 9-12, we show the 
sparsest models that achieve perfect predictive accuracy. We omit models from some methods because they 
do not attain perfect accuracy (CART), or use far more features (Ridge, SVM Lin, SVM RBF). 

Here, the SLIM scoring system uses 7 integer coefficients. However, it can be simplified into a 5 line 
scoring system since odor=none, odor—almond, and odor=anise are mutually exclusive variables with the 
same coefficient. The model lets users make predictions by hand, and uses a linear form that helps users 
gauge the influence of each input variable with respect to the others. Note that only some of these qualities 
are found in the other models. The Lasso model, for instance, has a linear form but uses far more features. 
In contrast, the C5.0 models let users to make predictions by hand, but have a hierarchical structure that 
makes it difficult to gauge the influence of each input variable with respect to the others. 

We note that these qualities represent “baseline” interpretability benefits. In practice, interpretability is 
a subjective and multifaceted notion (i.e., it depends on who will be using the model, and on many model 
qualities, as discussed in Kodratoff (1994), Pazzani (2000), Freitas (2014)). In light of this, SLIM has a 
additional interpretability benefit because it allows practitioners to work closely with their target audience 
and encode all interpretability-related requirements into their model by means of operational constraints. 
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PREDICT MUSHROOM IS POISONOUS IF SCORE > 3 


1. 

spore-print ^color = green 

4 points 


2. 

stalk.sur f ace_abovejring = grooves 

2 points 

+ . 

3. 

population = clustered 

2 points 

+ . 

4. 

gill-size = broad 

-2 points 

+ . 

5. 

odor E {none, almond, anise} 

-4 points 

+ . 


ADD POINTS FROM ROWS 1-5 

SCORE 

= . 


Fig. 9: SLIM scoring system for mushroom. This model has a 10-CV mean test error of 0.0 di 0.0%. 



10.86 spore-print-color — green 

+ 

4.49 gill-size — narrow 

-\- 

4.29 odor — foul 

+ 

2.73 stalk_surface_below_ring — scaly 

+ 

2.60 stalk_surface_above_ring — grooves 

-\- 

2.38 population — clustered 

+ 

0.85 spore_print-Color — white 

+ 

0.44 stalk_root — bulbous 

+ 

0.43 gill-spacing — close 

+ 

0.38 cap-color — white 

+ 

0.01 stalk-color-below-ring — yellow 

— 

8.61 odor — anise 

— 

8.61 odor — almond 

— 

8.51 odor — none 

— 

0.53 cap-Surface — fibrous 

- 

0.25 population — solitary 

- 

0.21 stalk-surface-below-ring = fibrous 

- 

0.09 spore-print-color — brown 

- 

0.00 cap_shape — convex 

- 

0.00 gill-spacing — crowded 

- 

0.00 gill-size — broad 


+ 0.25 

Fig. 10: Lasso score function for mushroom. This model has a 10-CV mean test error of 0.0 di 0.0%. 
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Fig. 11: C5.0 decision tree for mushroom. This model has a 10-CV mean test error of O.OdiO.0%. 


Rule 


Confidence 

Support 

Lift 

odor — none A gilLsize 7^ narrow A sporejprint.color 7^ green ■■ 

=>■ safe 

1.000 

3216 

1.9 

bruises — false A odor — none A stalksurface-below-ring 7^ scaly 

=>■ safe 

0.999 

1440 

1.9 

odor — almond 

=>■ safe 

0.998 

400 

1.9 

odor — anise ■■ 

=>■ safe 

0.998 

400 

1.9 

odor 7^ almond A odor 7^ anise A odor 7^ none 

=>■ poisonous 

1.000 

3796 

2.1 

spore-print-color — green ■ 

=>■ poisonous 

0.986 

72 

2.9 

gilLsize — narrow A stalksurface.below.ring — scaly 

=>■ poisonous 

0.976 

40 

2.0 


Fig. 12: C5.0 rule list for mushroom. This model has a 10-CV mean test error of 0.0 di 0.0%. 
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Fig. 13: Accuracy and sparsity of all classification methods on all datasets. For each dataset, we plot the 
performance of models when free parameters are set to values that minimize the 10-CV mean test error (left), 
and plot the performance of SLIM and Lasso across the full ^o-regularization path (right). 
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Fig. 14: Accuracy and sparsity of all classification methods on all datasets. For each dataset, we plot the 
performance of models when free parameters are set to values that minimize the 10-CV mean test error (left)m 
and plot the performance of SLIM and Lasso across the full ^o-regularization path (right). 
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7 Specialized Models 

In this section, we present three specialized models related to SLIM. These models are all special instances 
of the optimization problem in (1). 


7.1 Personalized Models 


A Personalized Integer Linear Model (FILM) is a generalization of SLIM that provides soft control over the 
coefficients in a scoring system. To use this model, users define i? + 1 interpretability sets, 

Cr = {L,i, • • •, lr,Kr} for r = 0,..., i?, 

as well as a “personalized” interpretability penalty, 

( Cq if Xj € Cq 


= 


< Cn if A, G Ca¬ 


in order to penalize coefficients from less interpretable sets more heavily, we need that: (i) £i,... ,Ca are 
mutually exclusive; (ii) Cr is more interpretable than Cr+i\ (hi) the trade-off parameters are monotonically 
increasing in r, so that Co < ... < Ca- The values of the parameters Cr can be set as the minimum gain in 
training accuracy required for the optimal classiher to use a coefficient from Cr- 
As an example, consider training a FILM scoring system with the penalty: 

' Co = 0.00 if \j G 0 

= S C”! = 0.01 if Aj G ±{1,...,10} 

C2 = 0.05 if Aj G ±{11,...,100}. 

Here, the optimal classifier will use a coefficient from Ci if it yields at least a 1% gain in training accuracy, 

and a coefficient from C2 if it yields at least a 5% gain in training accuracy. 

We can train a FILM scoring system by solving the following IP: 


min 


s.t. 


N P 


i=i 


Mi'tpi 

> 

7 ViXjXjj 

i = 

= 1,.. 

..,N 

0-1 loss 
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= 
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1 
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Here, the loss constraints and Big-M parameters in (12a) are identical to those from the SLIM IP formulation 
in Section 2. The are binary indicator variables that are set to 1 if \j is equal to Constraints (12d) 
ensure that each coefficient uses exactly one value from one interpretability set. Constraints {12c) ensure 
that each coefficient \j is assigned a value from the appropriate interpretability set Cr- Constraints (12b) 
ensure that each coefficient \j is assigned the value specified by the personalized interpretability penalty. 


7.2 Rule-Based Models 

SLIM can be extended to produce specialized “rule-based” models when the training data are composed of 
binary rules. In general, any real-valued feature can be converted into a binary rule by setting a threshold 
(e.g., we can convert age into the feature age > 25 := 1 [age > 25]). The values of the thresholds can be set 
using domain expertise, rule mining, or discretization techniques (Liu et ah, 2002). 

In what follows, we assume that we train models with a binarized dataset that contains Tj binary rules 
hj^t {0,1}^ for each feature Xj € in the original dataset. Thus, we consider models with the form: 

We make the following assumptions about the binarization process. If Xj is a binary variable, then it is left 
unchanged so that Tj = 1 and hj Xj '■= Xj. If Xj is a categorical variable Xj € {1,..., K}, the binarization 
yields a binary rule for each category so that Tj = K and hj^t ■= 1 [xj = k] for t = 1,... ,K. Xj is a real 
variable, then the binarization yields Tj binary rules^ of the form hj f := 1 [xj > Vj^t] where Vj t denotes the 
threshold for feature j. 


/ P T, 

y = sign Ao + EE 

V j=i‘=1 


7.2.1 M-of-N Rule Tables 

M-of-N rule tables are simple rule-based models that, given a set of N rules, predict y = -|-I if at least M of 
them are true (see e.g.. Figure 15). These models have the major benefit that they do not require the user 
to compute a mathematical expression. M-of-N rule tables were originally proposed as auxiliary models 
that could be extracted from neural nets (Towell and Shavlik, 1993) but can also be trained as stand-alone 
discrete linear classification models as suggested by Chevaleyre et al. (2013). 

We can produce a fully optimized M-of-N rule table by solving an optimization problem of the form: 

N 

E ^ bnyi < 0] + C'o II-^IIq 

i=l 

s.t. Aq g {~p, ■ ■ ■ 5 0} 

t=l,...,Tj. 

The coefficients from this optimization problem yield an M-of-N rule table with M = Aq + 1 and N = 
12^=1 ^j,t- Here, we can achieve exact ffi-regularization using an ^i-penalty since ||Aj',t||g = for 

Xj^t £ {0,1}. Since we use the 0-1 loss, the trade-off parameter Co can be set as minimum gain in training 
accuracy required to include a rule in the optimal table. 

® While there exists an infinite number of thresholds for a real-valued feature, we only need consider at most N — 1 
thresholds (i.e. one threshold placed each pair of adjacent values, < Uj,t < Using additional thresholds will 

produce the same set of binary rules and the same rule-based model. 
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PREDICT TUMOR IS BENIGN 
IF AT LEAST 5 OF THE FOLLOWING 8 RULES ARE TRUE 


UniformityOfCellSize > 3 
UniformityOf Cell Shape > 3 
Marginal Adhesion > 3 
SingleEpithelialCellSize > 3 
BareNuclei > 3 
Normal Nucleoli > 3 
BlandChromatin > 3 
Mitoses > 3 

Fig. 15: M-of-N rule table for the breastcancer dataset for Co = O.Q/AP. This model has 8 rules and a 
10-CV mean test error of 4.8 di 2.5%. We trained this model with binary rules hij := 1 [xij > 3]. 


We can train an M-of-N rule table by solving the following IP: 

^ N P 

min -r; 'ibi + 

i=i i=i 
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G {-P,...,0} 






intercept value 


^j,t 

G {0,1} 
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..,P t 

= 1,. 
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G {0,1} 
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..,N 
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..,P t 

= 1,. 

-P 

int. penalty values 



Here, the loss constraints and Big-M parameters in (13a) are identical to those from the SLIM IP formulation 
in Section 2. Constraints (13b) define the penalty variables as the value of the £o-penalty. 

7.2.2 Threshold-Rule Models 

A Threshold-Rule Integer Linear Model (TILM) is a scoring system where the input variables are thresholded 
versions of the original feature set (i.e. decision stumps). These models are well-suited to problems where 
the outcome has a non-linear relationship with real-valued features. As an example, consider the SAPS 
II scoring system of Le Gall et al. (1993), which assesses the mortality of patients in intensive care using 
thresholds on real-valued features such as bloodjpressure > 200 and heartjrate < 40. TILM optimizes the 
binarization of real-valued features by using feature selection on a large (potentially exhaustive) pool of 
binary rules for each real-valued feature. Carrizosa et al. (2010), Van Belle et al. (2013) and Goh and Rudin 
(2014) take different but related approaches for constructing classihers with binary threshold rules. 

We train TILM scoring systems using an optimization problem of the form: 

I ^ 

min — 1 [yiVi < 0] -|- C/ • Features -|- Ct ■ Rules per Feature -|- t ||A||j^ 

i=l 

s.t. A G £, 

Tj 

N 1 ['''14 7^ 0] < Rmax for j = I, . . . , P, 
t=l 

sign (Aju) = ... = sign (AyrJ for j = 1,..., P. 
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TILM uses an interpretability penalty that penalizes the number of rules used in the classiher as well as the 
number of features associated with these rules. The small ^i-penalty in the objective restricts coefficients to 
coprime values as in SLIM. Here, Cf tunes the number of features used in the model, Ct tunes the number 
of rules per feature, and e is set to a small value to produce coprime coefficients. TILM includes additional 
hard constraints to limit the number of rules per feature to Rmax (e.g., Rmax = 3), and to ensure that the 
coefficients for binary rules from a single feature agree in sign (this ensures that each feature maintains a 
strictly monotonically increasing or decreasing relationship with the outcome). 

We can train a TILM scoring system by solving the following IP: 
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Here, the loss constraints and Big-M parameters in {14a) are identical to those from the SLIM IP formulation 
in Section 2. Constraints (14b) set the interpretability penalty for each coefficient as $j = CfVj + CtTj + 
The variables in the interpretability penalty include: zzj, which indicate that we use at least one 
threshold rule from feature j; tj, which count the number of additional binary rules derived from feature 
j; and Pj^t ■= The values of Uj and tj are set using the indicator variables aj t := 1 [\j^t 7^ 0] in 

constraints {14c) and (14d). Constraints (14e) limit the number of binary rules from feature j to Mmax- 
Constraints (14f) ensure that the coefficients of binary rules derived from feature j agree in sign; these 
constraints are encoded using the variables 5j := 1 [Xj^t > 0]- 
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8 Conclusion 

In this paper, we introduced a new method for creating data-driven medical scoring systems which we refer 
to as a Supersparse Linear Integer Model (SLIM). We showed how SLIM can produce scoring systems that 
are fully optimized for accuracy and sparsity, that can accomodate multiple operational constraints, and 
that can be trained without parameter tuning. 

The major benefits of our approach over existing methods come from the fact that we avoid approxi¬ 
mations that are designed to achieve faster computation. Approximations such as surrogate loss functions 
and -regularization hinder the accuracy and sparsity of models as well as the ability of practitioners to 
control these qualities. Such approximations are no longer needed for many datasets, since using current 
integer programming software, we can now train scoring systems for many real-world problems. Integer 
programming software also caters to practitioners in other ways, by allowing them to choose from a pool of 
models by mining feasible solutions and to seamlessly benefit from periodic computational improvements 
without revising their code. 
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A Proofs of Theorems 

Proof of Theorem 1 (Minimum Margin Resolution Bound) 

Proof We use normalized versions of the vectors, p/ ||p ||2 and A/yl because the 0—1 loss is scale invariant: 
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We set A > as in (5). Using A, we then define X/A element-wise so that XjfA is equal to pj/ \\p \\2 rounded 

to the nearest 1/yl for j = 1,..., P. 

We first show that our choice of A and A ensures that the difference between the margin of p/ \\p \\2 and the margin 
of A/yl on all training examples is always less than the minimum margin of p/ ||p|| 2 ) defined as 7min = min* This 

statement follows from the fact that, for all i\ 
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Here; the inequality in (15) uses the Cauchy-Schwarz inequality; the inequality in (16) is due to the fact that the distance 
between pj/ \\p \\2 and Xj/A is at most 1/2A] and the inequality in (17) is due to our choice of yl. 

Next, we show that our choice of yl and A ensures that p/ ||p ||2 and X/A classify each point in the same way. We 
consider three cases; first, the case where Xi lies on the margin; second, the case where p has a positive margin on and 
third, the case where p has a negative margin on Xi. For the case when Xi lies on the margin, min^ \p'^Xi\ = 0 and the 
theorem holds trivially. For the case where p has positive margin, p^xi > 0, the following calculation using (18) is relevant; 
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Thus, we have shown that xi > 0 whenever p^xi > 0. 

For the case where p has a negative margin on ajj, p^Xi < 0, we perform an analogous calculation: 
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and then using that p^Xi < 0, 


\p'^Xi\ . \p'^Xi\ -p'^Xi . \p'^Xi\ X^x 


Thus, we have shown X^Xi < 0 whenever p^xi < 0. 

Putting both the positive margin and negative margin cases together, we find that for all i, 

1 \yiP^Xi < oj = 1 Xi < oj . 

Summing over i yields the statement of the theorem. □ 


Proof of Theorem 3 (Generalization of Sparse Discrete Linear Classifiers) 


Proof Let Z{A;Djv) = ^ E^Ii 1 [v.X'^ < O] + Co ll-^llo ■ Note that A = 0 is a feasible solution since we assume that 

0 G C. Since A = 0 achieves an objective value of Z(0]T)n) = 1, any optimal solution, A E argmin;^^£ Z(X;T)f^), must 
attain an objective value Z(X]T)n) < 1. This implies 


Z{X;Vn) < 1, 
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Co l|A||o < ^ ^ 1 [y^X'^Xi < o] + Co ||A||o < 1, 
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The last line uses that ||A||o is an integer. 

Thus, 'Hp,c’o large enough to contain all minimizers of Z[-\T>]S[) for any T>^. The statement of the theorem follows 
from applying Theorem 2. □ 


Proof of Theorem 5 (Eqnivalence of the Reduced Data) 

Proof Let us denote the set of classifiers whose objective value is less or equal to Z(f*] T)^) as 

= {/ e I Z(/; Vm) < z(f-, Vm) + e] . 

In addition, let us denote the set of points that have been removed by the data reduction algorithm 
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By definition, data reduction only removes an example if its sign is fixed. This means that sign {f{xi)) = sign (^f{xi )) for 
alH E tS and / E Thus, we can see that for all classifiers / E 
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We now proceed to prove the statement in (11). When <5 = 0, then 'Dn = and (11) follows trivially. When, <5 7 ^ 0, 
we note that 

T* = argminZ(/; Div) = argmin Z(/; Vm U 5), 

/e.F 

= argminZ(/;X>M) + Z{f-,S), 

/eJF 

= argminZ(/;X>M) + C', ( 20 ) 

/eJF 

= argminZ(/;X>M)- 

/ 6 ^ 

Here, the statement in (20) follows directly from (19). □ 

Proof of Theorem 6 (Sufficient Conditions to Satisfy the Level Set Condition) 

Proof We assume that we have found a surrogate function, ijj, that satisfies conditions I-IV and choose C .0 > 2e. 

Our proof uses the following result: if ||Aqj — A^|| > Cx then Aq^^ cannot be a minimizer of Zqi (A) because this would 
lead to a contradiction with the definition of Aq^. To see that this result holds, we use condition III with A = Aq^ to see 
that ||Aq^ - A^ll > Cx implies (Ag^) - Thus, 
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Here the inequality in (21) follows from condition IV, the inequality in (22) follows from condition I, and the inequality in 
(23) follows from our choice that Cip > 2£. 

We proceed by looking at the LHS and RHS of (23) separately. Using condition I on the LHS of (23) we get that: 

Zoi(a;)+£<Z^(a;)+£. (24) 

Using condition IV on the RHS of (23) we get that: 

^ ^01 (A5i) + £■ (25) 

Combining the inequalities in (23), (24) and (25), we get that: 

Zoi (a;) <Zoi(ASi). (26) 

The statement in (26) is a contradiction of the definition of Aq^. Thus, we know that our assumption was incorrect and 
thus ||Aq^ — A^ll < Cx- We plug this into the Lipschitz condition H as follows: 

Z^ (ASi) - z^ (a;) < L||ASi - a;ii < LCx, 

Z^ {X*o^)< LCx + z^ (^X*^) . 

Thus, we have satisfied the level set condition with e = LCx. 



